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Abstract 

A  Markov  decision  process  is  a  generalization  of  a  Markov  chain  in  which  both  prob¬ 
abilistic  and  nondeterministic  choice  coexist.  Given  a  Markov  decision  process  with 
costs  associated  with  the  transitions  and  a  set  of  target  states,  the  stochastic  shortest 
path  problem  consists  in  computing  the  minimum  expected  cost  of  a  control  strategy 
that  guarantees  to  reach  the  target.  In  this  paper,  we  consider  the  classes  of  stochas¬ 
tic  shortest  path  problems  in  which  the  costs  are  all  non-negative,  or  all  non-positive. 
Previously,  these  two  classes  of  problems  could  be  solved  only  under  the  assumption 
that  the  policies  that  minimize  or  maximize  the  expected  cost  also  lead  to  the  target 
with  probability  1.  This  assumption  does  not  necessarily  hold  for  Markov  decision 
processes  that  arise  as  model  for  distributed  probabilistic  systems.  We  present  effi¬ 
cient  methods  for  solving  these  two  classes  of  problems  without  relying  on  additional 
assumptions.  The  methods  are  based  on  algorithms  to  transform  the  original  problems 
into  problems  that  satisfy  the  required  assumptions.  The  methods  lead  to  the  efficient 
solution  of  two  basic  problems  in  the  analysis  of  the  reliability  and  performance  of 
partially-specified  systems:  the  computation  of  the  minimum  (or  maximum)  proba¬ 
bility  of  reaching  a  target  set,  and  the  computation  of  the  minimum  (or  maximum) 
expected  time  to  reach  the  set. 


1  Introduction 

Markov  decision  processes  are  generalizations  of  Markov  chains  in  which  probabilistic 
choice  coexists  with  nondeterministic  choice  [Bel57].  Several  models  of  distributed  proba¬ 
bilistic  systems  are  based  either  on  Markov  decision  processes  [BdA95,  KB98]  or  on  closely 
related  formalisms,  such  as  the  concurrent  Markov  chains  of  [Var85],  the  probabilistic 
automata  of  [SL94,  WSS94],  and  the  timed  probabilistic  automata  of  [Seg95].  Several 
models  based  on  process  algebras  are  also  closely  related  to  Markov  decision  processes 

*  An  abbreviated  version  of  this  paper  is  to  appear  in  Proceedings  of  CONCUR  99:  Concurrency  Theory. 
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[LS89,  JL91,  WSS94],  In  these  proposals,  probability  enables  the  modeling  of  phenom¬ 
ena  related  to  reliability  and  performance,  while  nondeterminism  has  been  used  to  model 
concurrency  [Var85,  PZ86,  Seg95],  inputs  [Seg95],  imprecise  knowledge  of  the  transition 
probabilities  [dA97,  dA98],  and  in  general  any  behavior  for  which  probabilistic  information 
is  not  known. 

A  Markov  decision  process  (MDP)  consists  of  a  set  of  states;  with  each  state  is  as¬ 
sociated  a  set  of  possible  actions.  At  every  state,  the  choice  of  the  next  action  is  non- 
deterministic;  once  chosen,  the  action  determines  the  transition  probability  distribution 
for  the  successor  state.  In  order  to  quantify  the  probabilistic  properties  of  an  MDP,  the 
concept  of  policy  is  introduced  [Der70],  related  to  the  schedulers  of  [Var85,  PZ86]  and  to 
the  adversaries  of  [SL94,  Seg95].  A  policy  is  a  criterion  for  selecting  the  actions  during 
a  behavior  of  the  system;  once  the  policy  is  fixed,  the  MDP  is  reduced  to  a  conventional 
stochastic  process.  A  simple  way  to  introduce  time  in  these  models  is  to  associate  with 
each  pair  consisting  of  state  and  of  a  related  action  the  time  (or  the  expected  time)  spent 
at  the  state  when  the  action  is  selected  [Han94,  Seg95,  dA98].  One  of  the  basic  questions 
we  can  ask  about  the  timing  behavior  of  such  a  system  is  the  expected  time  needed  to 
reach  a  given  set  of  target  states  from  a  specified  starting  state.  Being  able  to  answer 
this  question  opens  the  way  to  the  automated  verification  of  systems  properties  such  as 
expected  time  to  failure,  expected  task  completion  time,  and  several  others.  Since  the  sys¬ 
tem  model  includes  nondeterminism,  the  answer  to  this  expected  time  question  consists 
not  in  a  single  value,  but  rather  in  a  range  of  values  comprised  between  a  minimum  and  a 
maximum,  depending  on  whether  the  policy  in  use  hastens  or  delays  the  reaching  of  the 
target.  This  paper  is  concerned  with  the  question  of  how  to  compute  these  minimum  and 
maximum  values. 

The  problem  of  computing  the  maximum  and  minimum  reachability  times  can  be 
reduced  to  the  stochastic  shortest  path  (SSP)  problem  [EZ62,  Der70].  In  the  statement 
of  the  SSP  problem,  with  each  state-action  pair  is  associated  a  real- valued  cost;  the  SSP 
problem  consists  in  computing  the  minimum  expected  cost  incurred  to  reach  a  set  of 
target  states.  Hence,  to  compute  the  minimum  (resp.  maximum)  reachability  time,  it 
suffices  to  equate  the  cost  to  the  time  (resp.  to  the  time  multiplied  by  —1)  and  to  solve 
the  resulting  SSP  problem.  However,  previous  solutions  to  the  SSP  problem  rely  on 
assumptions  that  do  not  necessarily  hold  for  the  SSP  problems  obtained  by  the  above 
reduction.  In  particular,  previous  solutions  require  that  the  target  set  can  be  reached 
with  probability  1  from  every  state,  and  that  either  (a)  every  policy  that  does  not  lead 
to  the  target  with  probability  1  yields  infinite  expected  total  cost,  or  (b)  the  policies  that 
minimize  or  maximize  the  expected  total  cost  also  lead  to  the  target  with  probability  1 
[BT91,  Ber95].  Under  either  one  of  these  assumptions,  the  goal  of  reaching  the  target  can 
be  disregarded  in  the  solution  of  the  optimization  problem,  and  the  SSP  problem  can  be 
solved  by  determining  the  policy  that  minimizes  the  total  cost.  If  the  starting  and  target 
states  are  part  of  a  formal  specification,  or  if  the  time  associated  with  state-action  pairs 
can  be  0,  as  in  [Han94,  Seg95,  dA97,  dA98],  these  assumptions  do  not  hold  in  general, 
and  new  solution  methods  are  required. 

The  aim  of  this  paper  is  to  present  methods  for  solving  the  SSP  problem  that  rely  on 
the  assumptions  that  the  costs  are  all  non-negative,  or  all  non-positive.  We  call  the  SSP 
problems  that  satisfy  these  assumptions  the  non-negative  and  non-positive  SSP  problems. 


2 


Solving  these  SSP  problems  suffices  for  solving  the  original  problem  about  the  maximum 
and  minimum  reachability  times.  Furthermore,  we  show  that  the  proposed  solution  meth¬ 
ods  can  be  applied  to  the  efficient  computation  of  the  maximum  and  minimum  probability 
of  reaching  a  target  set  of  states. 

The  minimum  expected  cost  to  reach  a  set  of  target  states  is  well  defined  only  if 
the  target  can  be  reached  with  probability  1.  The  first  step  in  the  solution  of  the  SSP 
problem  consists  thus  in  computing  the  set  of  states  from  which  the  target  set  can  be 
reached  with  probability  1.  This  problem  can  be  solved  in  polynomial  time  by  a  reduction 
to  linear  programming  [Der70].  In  this  paper  we  present  a  more  efficient  algorithm,  that 
solves  the  problem  in  time  quadratic  in  the  size  of  the  MDP,  and  that  does  not  require 
numerical  computation.  The  algorithm,  originating  from  [dA97],  is  related  to  an  algorithm 
for  solving  two-person  reachability  games  presented  in  [dAHK98]. 

Once  we  have  determined  the  states  from  which  the  target  set  cannot  be  reached  with 
probability  1,  we  present  two  methods  for  solving  the  SSP  problem  on  the  remaining  states. 
First,  we  show  that  non-negative  and  non-positive  SSP  problems  can  be  solved  using 
linear  programming  over  the  extended  field  1R  U  {±oo}.  Second,  we  present  translation 
algorithms  that  transform  non-negative  and  non-positive  SSP  problems  into  SSP  problems 
that  satisfy  the  assumptions  previously  considered  in  the  literature  [BT91,  Ber95].  This 
enables  the  use  of  several  well-known  techniques  for  the  solution  of  non-negative  and  non¬ 
positive  SSP  problems,  such  as  value  iteration  methods,  and  methods  based  on  learning 
and  sample  path  analysis  (see  [BT91,  Ber95]  again).  The  translation  algorithms  have 
strongly-polynomial  time  complexity  in  the  size  of  the  MDP  being  translated.  As  the 
algorithms  never  increase  and  often  reduce  the  size  of  the  MDPs,  they  also  perform  a 
beneficial  pre-conditioning  prior  to  the  application  of  numerical  solution  methods. 

Finally,  we  apply  the  algorithms  presented  in  this  paper  to  the  computation  of  the 
minimum  and  maximum  probability  of  reaching  a  set  of  target  states.  The  computation 
of  the  minimum  reachability  probability  is  useful  for  determining  lower  bounds  for  the 
probability  of  reaching  desirable  system  configurations,  or  of  accomplishing  tasks  from 
given  starting  points.  The  computation  of  the  maximum  reachability  probability  is  one 
of  the  basic  problems  in  probabilistic  verification:  aside  from  being  of  interest  in  its  own 
right,  it  is  at  the  basis  of  the  algorithms  for  the  determination  of  the  maximum  and 
minimum  probability  with  which  a  linear-time  temporal  logic  formula  holds  over  an  MDP 
[CY90,  CY95,  BdA95].  While  the  maximum  reachability  probability  can  be  computed 
with  the  algorithms  of  [CY90],  the  proposed  approach  minimizes  the  size  of  the  numerical 
problem  to  be  solved. 

2  Preliminaries 

A  Markov  decision  process  (MDP)  is  a  generalization  of  a  Markov  chain  in  which  nonde- 
terministic  choice  coexists  with  probabilistic  one.  Markov  decision  processes  are  closely  re¬ 
lated  to  the  probabilistic  automata  of  [Rab63],  to  the  concurrent  Markov  chains  of  [Var85], 
and  to  the  simple  probabilistic  automata  of  [SL94,  Seg95].  To  present  their  definition,  given 
a  countable  set  C  we  denote  by  T> { C)  the  set  of  probability  distributions  over  C.  i.e.  the 
set  of  functions  /  :  C  4  [ 0,1]  such  that  J2xec  f  (x)  =  1-  Given  a  distribution  /  E  T>(C), 
we  indicate  by  Support(f)  =  {x  E  C  \  f(x )  >  0}. 
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An  MDP  A4  =  ( S ,  Acts,  A, p)  consists  of  the  following  components: 

•  a  finite  set  S  of  states; 

•  a  finite  set  Acts  of  actions; 

•  a  function  A  :  S  2Acts  that  associates  with  each  s  E  S  a  finite  set  A(s)  C  Acts  of 
actions  available  at  s: 

•  a  function  p  :  S  x  Acts  i->-  V(S)  that  associates  with  each  s,t  E  S  and  a  E  A(s)  the 
probability  p(s,  a)(t)  of  a  transition  from  s  to  t  when  action  a  is  selected. 

A  path  of  the  MDP  At  is  an  infinite  sequence  u  :  sq,  ao,  si,  oq, . . .  of  alternating  states  and 
actions,  such  that  S{  E  S,  a\  E  A(s; )  and  p(si,  aj)(sj+i)  >  0  for  all  i  >  0.  For  i  >  0,  the 
sequence  is  constructed  by  iterating  a  two-phase  selection  process.  First,  an  action  a,  E 
A(sj)  is  selected  nondeterministically;  second,  the  successor  state  sl+  j  is  chosen  according 
to  the  probability  distribution  p(si,  a).  Given  a  path  u>  :  so,ao,  s\,ai, . . .  and  k  >  0,  we 
denote  by  34  (w)  its  k-th.  state  and  its  k- th  action  a*,  respectively.  Given  a  state 

s  E  S  and  an  action  a  E  A(s)  for  s ,  we  also  denote  by  dest{s,a)  =  {t  E  S  \  p(s,a)(t)  >  0} 
the  set  of  possible  successors  of  s  when  a  is  selected. 

To  be  able  to  talk  about  the  probability  of  system  behaviors,  we  need  to  specify  the 
criteria  with  which  the  actions  are  chosen.  To  this  end,  we  use  the  concept  of  policy  [Der70], 
closely  related  to  the  adversaries  of  [SL94,  Seg95]  and  to  the  schedulers  of  [Var85,  PZ86]. 
A  policy  rj  is  a  mapping  r/  :  S+  ^  V(Acts),  which  associates  with  each  finite  sequence  of 
states  so,  si, . . . ,  sn  E  S+  and  each  a  E  A(sn)  the  probability  ry(so, . . . ,  sn)(a )  of  choosing 
a  after  following  the  sequence  of  states  so, . . . ,  sn.  We  require  that  ry(so, . . . ,  sn)(a)  >  0 
implies  a  E  A(sn ):  a  policy  can  choose  only  among  the  actions  that  are  available  at  the 
state  where  the  choice  is  made.  We  indicate  with  Pol  the  set  of  all  policies.  We  say 
that  a  policy  r]  is  memoryless  if  rj(so,. . .  ,sn)(a)  =  r}(sn)(a)  for  all  sequences  of  states 
so,  ■  ■  ■ ,  sn  E  S+  and  all  a  E  A(s). 

For  every  state  s  E  S,  we  denote  by  Qs  the  set  of  paths  having  s  as  initial  state,  and  we 
let  Bs  C  2ils  be  the  er-algebra  of  measurable  subsets  of  Os,  following  the  classical  definition 
of  [KSK66].  Under  policy  r]  the  probability  of  following  a  finite  path  prefix  soaosiai  ■  ■  ■  sn 
is  rliso  ' ' '  si)(ai)-  These  probabilities  for  prefixes  give  rise  to  a  unique 

probability  measure  on  Bs.  We  write  Pr^ {A)  to  denote  the  probability  of  event  A  in  Os 
under  policy  r\,  and  {  / }  to  denote  the  expectation  of  the  random  function  /  from  state 
s  under  policy  rj. 

2.1  The  stochastic  shortest  path  problem 

An  instance  II  =  (S,  Acts,  A,p,  R,c,  g)  of  the  stochastic  shortest  path  problem  consists  of 
an  MDP  (S,  Acts,  A,p),  together  with  the  additional  components  R,  c  and  g: 

•  R  C  S  is  the  the  set  of  destination  states; 

•  c  :  S  x  Acts  i  ^  ID  is  the  running  cost  function,  that  associates  with  each  state 
s  E  S\R  and  each  action  a  E  A(s)  the  cost  c(s,  a); 
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•  g  :  R  (->•  IR  is  the  terminal  cost  function,  that  associates  to  each  s  E  R  its  terminal 
cost  g(s). 

We  say  that  an  instance  of  the  SSP  problem  is  non-negative  (resp.  non-positive )  if  c(s ,  a)  > 
0  (resp.  c(s,a)  <  0)  for  all  s  E  S  and  a  E  A(s);  note  that  the  sign  of  g  is  not  relevant  for 
this  definition. 

The  SSP  problem  consists  in  determining  the  minimum  cost  of  reaching  R  when  fol¬ 
lowing  a  policy  that  reaches  R  with  probability  1,  provided  such  a  policy  exists.  Precisely, 
let  Tfi(co)  =  min{£:  |  Xk(u>)  E  R}  be  the  position  of  first  visit  of  a  path  in  R.  For  all  s  E  S 
we  denote  by  Prp(s)  =  {g  E  Pol  |  Pr ^(Tr  <  oc)  =  1}  the  set  of  policies  that  lead  from  s 
to  R  with  probability  1;  these  policies  are  the  proper  policies  for  s.  Given  a  state  s  E  S'. 
the  cost  vvs  of  a  policy  g  is  defined  by 

+  £  c(Xk,Yk) 

k=0 

A  policy  g  is  optimal  if  vvs  =  v *  for  all  s  E  S  \  R.  With  this  notation,  the  SSP  problem 
consists  in: 

1.  determining  the  set  Q  =  {s  E  S  \  R  \  Prp(s)  /  0}  of  states  having  at  least  one 
proper  policy; 

2.  computing  the  minimum  cost  v*s  =  inf ^eprp(s)  v1}  of  a  proper  policy  at  all  s  E  Q. 

Usually,  the  SSP  problem  is  considered  to  consist  only  in  the  second  question,  and  the 
existence  of  at  least  one  proper  policy  for  each  state  is  stated  as  an  assumption.  However, 
when  the  SSP  problem  is  used  to  compute  the  minimum  or  maximum  reachability  times 
between  an  initial  state  and  a  set  of  target  states  that  are  part  of  a  reliability  of  perfor¬ 
mance  specification,  we  cannot  assume  that  the  target  set  can  be  reached  from  the  initial 
state  with  probability  1.  Hence,  in  Section  2.3  we  present  an  algorithm  to  solve  also  this 
first  question.  In  addition,  we  will  characterize  the  optimal  policies  for  non-negative  and 
non-positive  SSP  problems. 

SSP  problem  and  reachability  time.  In  a  timed  probabilistic  system,  the  timing 
behavior  of  an  MDP  (S,  Acts,  A,p)  is  specified  by  means  of  a  function  time  :  S  x  Acts 
1R+  that  associates  with  each  s  E  S  and  a  E  A(s)  the  expected  amount  of  time  time(s,a ) 
spent  at  state  s  when  action  a  is  selected  [dA98].  Given  a  set  R  of  target  states,  to  compute 
the  minimum  (resp.  maximum)  expected  time  to  reach  R  it  suffices  to  solve  an  SSP 
problem  having  cost  functions  defined  by  c(s,a)  =  time(s,a)  (resp.  c(s,a)  =  —time(s,  a)) 
and  g(s)  =  0,  for  all  s  E  S  and  a  E  A{a).  The  minimum  (resp.  maximum)  expected  time 
to  reach  R  from  s  E  S\R  is  then  given  by  v*s  (resp.  — u*). 

2.2  End  components 

The  algorithms  that  we  present  to  solve  the  classes  of  SSP  problems  rely  on  the  notion 
of  end  component  [dA97].  End  components  are  the  analogous  concept  in  Markov  decision 
processes  of  the  closed  recurrent  classes  of  Markov  chains  [KSK66]:  they  represent  the 
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set  of  states  and  actions  that  can  be  repeated  infinitely  often  along  a  path  with  non-zero 
probability.  Related  sets  of  states  have  been  used  for  solving  optimization  problems  on 
MDPs  [CY95].  Given  an  MDP  M  =  (S,  Acts,  A,p),  a  sub-MDP  is  a  pair  (C,D),  where 
C  C  S  is  a  subset  of  states  and  D  :  S  i-»  Acts  is  a  function  that  associates  to  each  s  E  S 
a  subset  D(s)  C  A(s)  of  actions.  A  sub-MDP  (C,D)  is  an  end  component  if  the  following 
conditions  hold: 

•  Closure:  for  all  s  E  C,  a  E  D(s),  and  t  E  S,  if  p(s,a)(t)  >  0  then  t  E  C. 

•  Connectivity:  Let  E  =  {(s,f)  E  C  x  C  3a  E  D(s )  .p(s,a)(t)  >  0};  then,  the  graph 
(C,  E)  is  strongly  connected. 

We  say  that  an  end  component  (C,  D)  is  contained  in  a  sub-MDP  ( C ' ,  D')  if 

{(s,  a)  |  s  E  C  A  a  E  D(s)}  C  {(s,  a)  \  s  E  C'  A  a  E  D'(s)}  . 

We  say  that  an  end  component  (C,  D )  is  maximal  in  a  sub-MDP  (C1 ,  D')  if  there  is  no 
other  end  component  (C",D")  contained  in  (C .  D')  that  properly  contains  ( C.  D ) .  We 
denote  by  Mec(C' ,  D')  the  set  of  maximal  end  components  of  (C1,  D').  It  is  not  difficult  to 
see  that,  given  a  sub-MDP  (C,  D).  the  set  Mec(C.  D )  can  be  computed  in  time  polynomial 
m  |C1  +  Esec  | D(s)  |  using  simple  graph  algorithms;  an  algorithm  to  do  so  is  given  in  [dA97, 

OO 

§3].  Given  a  path  u,  denote  by  InfS(u)  =  {s  E  S  \  3  k.Xk(uj)  =  s}  the  set  of  states  visited 

OO 

infinitely  often  by  io.  where  3  is  a  shorthand  for  “there  are  infinitely  many  distinct”.  Also, 
define  In}A{oj)  :  S  2Acts  by  {a  E  A(s)  \  3  k  .  X^(co)  =  s  A  Yfc(u))  =  a}  for  all  s  E  S.  The 
following  theorem  summarizes  the  basic  property  of  end  components  [dA97]. 


Theorem  1  For  all  s  E  S  and  all  p  E  Pol,  we  have 


Pr^(/n/5(w),  InfA(u}))is  an  end  component ^ 


2.3  Computing  the  set  of  states  having  proper  policies 

As  a  first  step  in  the  solution  of  the  SSP  problem,  we  must  compute  the  set 

Reach(R)  =  js  E  S  \  3p  E  Pol .  Pr ^(Tr  <  oo)  =  1  j 

consisting  of  the  states  having  at  least  one  proper  policy.  This  problem  can  be  solved  by 
reducing  it  to  several  well-known  dynamic  programming  problems,  such  as  the  maximum 
average  reward  problem  [Der70]  or  the  maxim, um  reachability  probability  problem  [CY90]. 
However,  these  reductions  yield  algorithms  that  are  based  on  linear  programming,  and 
their  time  complexity  is  only  weakly  polynomial,  i.e.  it  depends  on  the  size  of  the  bit  strings 
encoding  the  probability  values  in  the  input  description  of  the  problem.  We  present  here 
an  algorithm  that  solves  the  problem  in  time  quadratic  in  the  size  of  the  MDP,  and  that 
does  not  require  any  numerical  computation.  The  algorithm  is  originally  from  [dA97],  and 
is  related  to  an  algorithm  for  solving  reachability  problems  in  two-person  games  presented 
in  [dAHK98].  The  algorithm  is  also  reminiscent  of  an  algorithm  independently  proposed 
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in  [Var95] .  To  present  the  algorithm,  given  two  subsets  X,  Y  C  S  of  states  we  define  the 
predicate  APre(Y,  X)  so  that  for  all  s  E  S, 

s  |=  APre(X,  Y)  iff  3a  E  A(s)  .  (^dest(s,a)  C7A  dest(s,a)  fl  X  /  0^  . 

Given  a  subset  R  of  target  states,  we  compute  Reach(R)  by  the  following  //-calculus 
expression: 

Reach(R)  =  vY  .  gX  .  ( APre{Y ,  X)  V  R)  ,  (2) 

where  we  have  used  the  slightly  improper  notation  of  denoting  by  R  a  predicate  that  holds 
exactly  for  the  states  in  R.  The  algorithm  (2)  can  be  understood  as  follows.  Denoting  by 
Y/j  the  value  of  the  set  Y  computed  at  iteration  k  >  0,  we  have  initially  To  =  S.  At  the 
end  of  the  first  iteration,  we  have  Y\  =  S  \  Co,  where  Co  is  the  subset  of  states  of  S  that 
cannot  reach  R.  At  the  end  of  the  second  iteration,  we  have  Y-2  =  Yi  \  Ci,  where  Ci  is  the 
set  of  states  that  cannot  reach  R  without  risking  to  enter  Co-  In  general,  at  the  end  of 
iteration  k  >  0,  we  have  Sk  =  Sk-i  \  Ck-\-  where  Ck-i  consists  of  the  states  that  cannot 
reach  R  without  risking  to  enter  (J*A02  Ci .  Given  an  MDP  At  =  (S.  A.p).  define  its  graph 
size  |  Ad  |  by 

\M\  =  Eses£aeA(s)  Support (p(s,  a))  . 

The  following  theorem  summarizes  the  results  about  this  algorithm. 

Theorem  2  Given  an  MDP  Af  =  (S,A,p)  and  a  set  R  C  S  of  target  states,  relation 
(2)  correctly  computes  Reach(R)  in  time  quadratic  in  |Ad|. 

Once  the  set  Reach(R)  has  been  computed,  we  can  replace  the  original  SSP  problem 
(S,  Acts,  A,  p,  R,  c ,  g )  with  a  new  problem  ( Q ,  Acts ,  A! , p' ,  R,  P ,  g') ,  where  Q  =  Reach(R), 
where  p',  c' ,  g'  are  the  restrictions  of  p,  c ,  g  to  Q,  and  where  for  all  s  E  Q  we  let 
A'(s)  =  {a  E  A(s)  |  dest(s,a )  C  Q}.  To  avoid  a  change  of  notation,  in  the  following  we 
denote  an  instance  of  the  SSP  problem  again  by  (S',  Acts ,  A,  p,  R,  c ,  g),  but  we  assume 
that  Reach(R)  =  S'.  This  is  equivalent  to  assuming  that  the  above  reduction  has  been 
made  already. 

3  Solving  Non-Negative  SSP  Problems 

The  class  of  SSP  problems  that  is  most  closely  related  to  the  non-negative  class,  and  for 
which  solution  methods  have  been  presented  in  the  literature,  is  discussed  in  [BT91,  Ber95]. 
There,  it  is  shown  that  the  SSP  problem  can  be  solved  under  the  additional  assumption 
that,  for  all  s  E  S,  there  is  a  proper  policy  that  minimizes  the  total  cost  (1).  An  example 
of  SSP  problem  in  which  this  assumption  does  not  hold  is  depicted  in  Figure  1.  Clearly, 
the  policy  that  minimizes  (1)  is  the  policy  rji  that  always  chooses  action  a  at  S3;  this 
policy  leads  to  the  expected  cost  =  1.  However,  this  policy  is  not  proper,  and  it  is 
easy  to  see  that  for  every  proper  policy  g  it  is  v%  =3. 

To  understand  why  the  iterative  approaches  such  as  value  iteration  cannot  be  applied 
immediately  to  this  problem,  let  n  =  |<S'\i?|,  and  denote  with  v  =  [vs]ses\it  £  IBn  a  vector 
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Figure  1:  An  instance  of  SSP  problem.  The  target  set  is  R  =  {54},  and  the  terminal  cost  is 
g(s 4)  =  0.  States  are  represented  as  nodes  of  a  graph,  and  actions  as  edges.  We  have  indicated 
only  the  actions  a  and  b  corresponding  to  state  S3,  where  A(s3)  =  {a,b}.  In  this  example,  all 
actions  (including  a  and  b)  are  deterministic,  i.e.  they  lead  to  only  one  destination  state.  The 
actions  are  labeled  with  their  cost  c.  The  two  actions  having  cost  0  have  been  indicated  with 
dashed  lines.  A  larger  instance  of  SSP  problem  is  presented  in  Figure  3. 


of  real  numbers  indexed  by  the  states  of  S\R.  Define  the  Bellman  operator  L  :  IRn  1-4  IRn 
on  the  space  of  v  by 


[L(v)]s 


min 

aEA(s) 


c(s,a)+  Y  + 

tes\u  tER 


seS\R ,  (3) 


where  [L(v)]s  denotes  the  s-components  of  vector  L(v).  Given  an  initial  vector  u°,  the 
value  iteration  method  computes  the  sequence  of  vectors  v°,  v1,  v2, ...  by  vk+ 1  =  L(vk). 
for  k  >  0,  and  returns  as  answer  lim^,^  vk.  provided  the  limit  exists.  The  initial  vector 
v°  represents  an  initial  (often  arbitrary)  estimate  for  the  minimum  expected  reachability 
cost;  each  iteration  of  the  Bellman  operator  L  is  aimed  at  improving  the  estimate.  Clearly, 
the  answer  returned  by  the  value  iteration  procedure  is  a  fixpoint  of  L.  However,  in  non¬ 
negative  SSP  problems  the  Bellman  operator  L  may  admit  more  than  one  fixpoint:  for 
example,  in  the  SSP  problem  of  Figure  1,  for  x  >  0  all  vectors 

v(x)  =  [ui,  v2,  v3]  =  [3, 2,  2]  -  x[l,  1, 1]  (4) 

satisfy  v  =  L(v).  If  L  admits  more  than  one  fixpoint,  the  sequence  v0,  v1,  v2, . . .  can 
converge  to  any  one  of  them,  depending  on  the  value  of  the  initial  estimate  v°.  In  the 
example  of  Figure  1,  starting  from  the  initial  vector  [0,0,0],  the  value  iteration  method 
converges  to  the  fixpoint  [1,0,0].  However,  we  will  prove  that  the  solution  of  the  SSP 
problem  corresponds  to  the  largest  fixpoint,  which  in  this  case  is  [3,2,2],  The  fact  that 
the  Bellman  operator  does  not  necessarily  admit  a  unique  fixpoint  in  non-negative  SSP 
problems  not  only  prevents  a  direct  application  of  value  iteration  methods,  but  also  blocks 
the  line  of  analysis  of  [BT91]  for  the  solution  based  on  linear  programming. 

We  present  two  approaches  to  the  solution  of  non-negative  SSP  problems.  The  first 
approach  is  based  on  the  observation  that  the  difficulties  in  solving  non-negative  SSP 
problems  stem  from  the  presence  in  the  SSP  problem  of  end  components  consisting  of 
state-action  pairs  having  0  cost.  If  we  remove  these  components,  we  obtain  an  equivalent 
problem  whose  Bellman  operator  has  a  unique  fixpoint;  the  problem  can  then  be  solved 
using  any  of  several  methods  that  have  been  developed  for  SSP  problems,  including  linear 
programming  and  value  iteration.  This  approach  has  two  advantages.  First,  it  enables 
to  exploit  in  the  solution  of  the  SSP  problem  many  numerical  techniques  that  have  been 
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devised  to  handle  large-sized  problems.  Second,  the  algorithm  that  removes  the  end 
components  often  achieves  a  reduction  of  the  size  of  the  problem. 

The  second  approach  consists  in  reducing  the  SSP  problem  directly  to  linear  program¬ 
ming:  since  the  solution  of  the  linear  programming  problem  corresponds  to  the  greatest 
fixpoint  of  the  Bellman  operator,  as  we  will  show,  it  also  corresponds  to  the  solution  of 
the  SSP  problem.  The  correctness  proof  of  this  second  approach  relies  on  an  analysis  of 
the  first  approach. 


3.1  Eliminating  0-cost  end  components 

A  0-cost  end  component  is  an  end  component  ( C ,  D )  such  that  c(s.  a)  =  0  for  all  s  E  C 
and  all  a  E  D(s).  As  we  will  show  (see  Theorem  3),  the  lack  of  uniqueness  of  the  fixpoint  is 
due  to  the  presence  of  0-cost  end  components  in  the  MDP.  In  a  0-cost  component  (C.  D). 
by  selecting  at  each  s  E  C  the  actions  in  D(s)  uniformly  at  random,  we  can  go  from  any 
state  of  C  to  any  other  state  of  C  with  probability  1  while  incurring  cost  0.  Hence,  the 
states  of  a  0-cost  end  component  are  equivalent  from  the  point  of  view  of  the  minimum 
cost  to  the  target.  The  following  algorithm  exploits  this  fact  to  eliminate  the  0-cost  end 
components  of  an  MDP  by  replacing  them  with  single  states.  The  algorithm  opens  the 
way  to  the  use  of  iterative  methods  based  on  the  Bellman  operator  for  the  solution  of  the 
non-negative  SSP  problem. 


Algorithm  1  (eliminating  0-cost  end  components) 

Input:  SSP  problem  n  =  (S',  Acts,  A,p,  i?,  c,  g). 

Output:  SSP  problem  n  =  ElimEC(Tl)  =  (S,Acts,A,p,R,c,g). 


Method:  For  each  s  E  S  \  R,  let  D(s)  =  {a  E  A(s)  |  c(s,  a)  =  0},  and  let  {(I?i,  T>i),  . . . , 
(Bn,Dn)}  =  Mec(S  \  R..  D)  be  the  set  of  0-cost  maximal  end  components  that  lie 
outside  R.  Define  S'  =  S'  U  {si, . . . ,  $n}  \  UF=i  -Bj,  where  sf, . . . ,  sn  are  new  states. 
The  action  sets  associated  with  the  states  are  defined  by: 


*ES\U?=iB(: 
1  <  i  <  n  : 


A(s)  =  {(s,  a)  |  a  E  A(s)} 

A(si)  =  j(s,  a)  s  E  Bi  A  a  E  A(s)  \  A(s)|  . 


For  s  E  S',  t  E  S'  \  ULi  -Bj  and  (tt,  a)  E  A(s),  the  transition  probabilities  are  defined 
by  p(s,  (u,o))(f)  =  p(u,  a)  (t)  and  p(s,  (u,o))(sj)  =  E teB,;  pK  o)(t).  For  s  E  S  and 
(ti,a)  E  A(s)  we  let  c(s,  (rt,  a))  =  c(ri,a);  for  s  E  R  we  let  g(s)  =  g(s).  I 


The  algorithm  replaces  each  0-cost  end  component  (Bj,Dj)  with  a  single  new  state  Sj, 
for  1  <  i  <  n.  The  actions  associated  with  si  consist  in  all  the  pairs  (f ,  a)  such  that 
s  E  Ci  and  a  E  A(s)  is  an  action  not  belonging  to  the  end  component.  Intuitively, 
taking  action  (s,a)  at  Sj  corresponds  to  taking  action  a  from  s,  possibly  leaving  Ct .  The 
transition  probabilities  and  costs  of  the  corresponding  actions  are  unchanged,  except  that 
the  probability  of  a  transition  to  s,  is  equal  to  the  probability  of  a  transition  into  Ci  in 
the  original  system,  for  1  <  i  <  n.  The  result  of  applying  Algorithm  1  to  the  instance 
of  SSP  depicted  in  Figure  1  is  illustrated  in  Figure  2.  The  (maximal)  end  component 
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Figure  2:  Result  of  applying  Algorithm  1  to  the  instance  of  SSP  problem  depicted  in  Figure  2. 
The  new  state  si  introduced  by  the  algorithm  is  drawn  as  a  filled  circle. 


Figure  3:  An  instance  of  SSP  problem  (left),  and  the  result  of  applying  Algorithm  1  to  it  (right). 
Here,  not  all  actions  are  deterministic,  and  we  depict  actions  that  can  lead  to  more  than  one 
destination  by  “bundles”  of  edges.  To  simplify  the  diagrams,  we  have  indicated  only  the  transition 
probabilities  corresponding  to  action  r,  and  we  have  omitted  all  costs.  The  actions  that  have  cost  0 
have  been  represented  by  dashed  edges.  The  target  set  is  R  =  {sg}.  The  new  states  si  and  si  that 
have  been  introduced  to  replace  the  zero-cost  end  components  are  indicated  by  filled  circles. 


formed  by  states  *2,  .S3  together  with  the  0-cost  actions  has  been  replaced  by  the  single 
state  Si.  Figure  3  depicts  another  example  of  application  of  Algorithm  1.  The  algorithm 
computes  the  0-cost  end  components  ( B2 , D2 ) ,  where  the  first  end  component  is 

given  by  B\  =  {33,54,37}  and  Di(ss)  =  {e?},  £>1(54)  =  {k},  £>1(57)  =  {j },  and  the  second 
one  by  B2  =  {55,56}  and  Dfis^)  =  {/},  D2(sq)  =  {5}.  The  algorithm  replaces  these  end 
components  with  the  two  new  states  Si  and  s2.  This  example  illustrates  the  potential 
reduction  of  the  state-space  of  the  system. 

Once  the  0-cost  end  components  have  been  eliminated,  the  next  lemma  shows  that  the 
reduced  problem  satisfies  the  following  two  assumptions: 

SSP-1  For  all  s  E  S,  we  have  Prp(s)  7^  0. 

SSP-2  For  all  s  E  S  and  p  0  Prp(s),  we  have  vf  =  00. 

Lemma  1  Consider  an  instance  II  of  non-negative  SSP  problem  such  that  there  is  at 
least  one  proper  policy  for  each  state,  and  let  II  =  ElimEC (II) .  Then,  II  satisfies  assump¬ 
tions  SSP-1  and  SSP-2. 
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Proof.  By  hypothesis  (or  more  accurately,  by  the  algorithm  presented  in  Section  2.3),  II 
satisfies  SSP-1.  By  Theorem  1,  the  set  of  states  and  actions  that  are  repeated  infinitely 
often  along  a  path  is  an  end  component.  Hence,  if  all  0-cost  end  components  have  been 
eliminated,  with  probability  1  a  path  that  does  not  reach  R  has  infinite  cost,  showing  that 
n  satisfies  SSP-2.  I 

The  class  of  SSP  problems  that  satisfies  assumptions  SSP-1  and  SSP-2  has  been  studied 
in  depth  in  the  literature.  In  particular,  it  is  known  that  the  Bellman  operator  admits 
a  unique  fixpoint  for  this  class  of  problems,  and  that  there  exist  optimal  policies  that 
are  memory  less  [BT91].  Moreover,  such  problems  can  be  solved  using  value- iteration  and 
policy-iteration  methods,  which  converge  to  the  solution  [BT91].  Other  refined  iterative 
methods  for  the  solutions  of  this  class  of  problems  are  presented  in  [BT96].  As  hinted  by 
Lemma  1,  the  uniqueness  of  the  fixpoint  of  the  Bellman  operator  is  related  to  the  presence 
of  0-cost  end  components. 

Theorem  3  Given  a  non-negative  instance  (S',  Acts,  A,  p,  R,  c,  g)  of  SSP  problem,  the 
Bellman  operator  L  admits  a  unique  fixpoint  iff  there  is  no  0-cost  end  component  ( C ,  D ) 
with  C  C  S  \  R. 

Proof.  In  one  direction,  assume  that  a  non-negative  instance  of  SSP  problem  does  not 
contain  any  0-cost  end  component.  Reasoning  as  for  Lemma  1,  we  have  that  assumption 
SSP-2  holds.  If  assumption  SSP-1  also  holds,  then  the  uniqueness  of  the  fixpoint  follows 
from  [BT91].  If  assumption  SSP-1  does  not  hold,  then  assumption  SSP-2  ensures  that  the 
fixpoint  of  the  Bellman  operator  diverges  to  +oo  on  the  states  where  there  is  no  proper 
policy.  This,  together  with  the  analysis  of  [BT91]  for  the  states  where  there  are  proper 
policies,  ensures  again  the  uniqueness  of  the  fixpoint.  Conversely,  if  there  is  a  0-cost  end 
component  in  a  non- negative  SSP  problem,  then  we  can  obtain  multiple  fixpoints  of  the 
Bellman  operator  by  selecting  one  such  end  component,  and  by  setting  the  value  of  the 
fixpoint  there  to  any  negative  value,  as  done  in  (4).  I 

The  following  theorem  relates  the  solutions  of  the  SSP  problems  n  and  II,  and  it 
enables  the  (trivial)  derivation  of  a  solution  for  n  from  a  solution  for  n. 

Theorem  4  Consider  an  instance  n  of  non-negative  SSP  such  that  there  is  at  least  one 
proper  policy  for  each  state,  and  let  II  =  ElimEC(U) .  Let  also  B\,. . . ,  Bn  be  the  0-cost  end 
components  that  are  replaced  by  states  sif  , . . ,  sn.  Denoting  by  v*  (resp.  v*)  the  solution 
of  the  SSP  problems  on  n  (resp.  II),  we  have  u*  =  v*  for  s  E  S\  IJ”=1  -Bj,  and  v *  =  v~. 
for  s  6  Bi,  1  <  i  <  n. 

Even  though  it  might  appear  intuitively  plausible  that  eliminating  the  0-cost  end  compo¬ 
nents  should  not  modify  the  solution  of  the  SSP  problem,  the  proof  of  the  above  theorem 
is  somewhat  involved;  it  can  be  found  in  [dA97].  The  same  analysis  also  leads  to  the 
following  result. 

Corollary  1  Non-negative  SSP  problems  admit  memoryless  optimal  policies. 


11 


3.2  Linear  programming 

The  second  approach  is  given  by  the  following  theorem. 

Theorem  5  Consider  an  instance  II  of  non-negative  SSP  such  that  there  is  at  least 
one  proper  policy  for  each  state.  Then,  the  solution  v*  of  the  SSP  problem  is  the  largest 
fixpoint  of  operator  L  defined  in  (3).  Moreover,  the  following  linear  programming  problem 
has  v*  as  unique  solution: 

Maximize  ^  vs  subject  to  vs  <  c(s,  a)  +  £  p{s,  a)  ( t )  vt  +  Y^  p{s ,  a)  ( t )  g(t) 
ses\R  tes\R  tER 

for  all  s  E  S  \  R  and  a  E  A(s). 

Proof.  The  theorem  is  proved  by  showing  first  that  every  fixpoint  of  the  Bellman  operator 
(3)  is  no  greater  (componentwise)  than  the  solution  of  the  SSP  problem.  Next,  we  use  the 
relationship  between  II  and  II  =  ElimEC(U)  to  show  that  one  of  the  fixpoints  is  equal  to 
the  solution  of  the  SSP  problem;  this  implies  that  the  solution  of  the  SSP  problem  is  the 
largest  fixpoint.  Finally,  it  can  be  shown  that  the  linear  programming  problem  converges 
to  the  largest  fixpoint,  and  thus  to  the  solution  of  the  SSP  problem.  The  details  can  be 
found  in  [dA97].  I 


4  Solving  Non-Positive  SSP  Problems 

Consider  an  instance  II  =  ( S,Acts,A,p,R,c,g )  of  non-positive  SSP  problem,  and  assume 
that  S  =  Reach(R),  i.e.  that  for  every  state  there  is  a  proper  policy.  Unlike  in  the  non¬ 
negative  case,  it  is  possible  that  v*  =  —  oo  for  some  s  E  S\R,  and  the  first  step  towards 
the  solution  of  non-negative  SSP  problems  consists  in  determining  the  set  of  states  from 
which  the  minimum  cost  diverges  to  — oo.  This  can  be  done  with  the  following  algorithm. 


Algorithm  2 

Input:  A  non-positive  SSP  problem  II  =  [S,  Acts,  A,p,  Rc,  g),  with  Reach(R)  =  S. 


Output:  The  subset  Diverge(U)  =  {s  |  r>*  =  — oo}. 


Method:  Let  C  :=  j(C,  D)  E  Mec(S  \R,A)  3s  E  C  .3a  E  D(s )  .  c(s,  a)  <  0  j 

be  the  set  of  end  components  outside  R  that  have  at  least  one  strictly  negative 
state-action  pair,  and  let  C  =  U (c,d)ec  @  he  the  union  of  their  states. 


Let  Cqo  —  gX  ■  (“'7?  A  (APre(S,  X)  V  C)  j  be  the  set  of  states  that  can  reach  C 
without  entering  R. 


Return:  Coo-  ■ 


Theorem  6  For  an  instance  II  of  non-positive  SSP  such  that  S  =  Reach(R),  we  have 
that  v *  =  — oo  iff  s  E  Diverge(H) . 
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Proof.  From  a  state  s  E  Diverge  (II),  we  can  reach  with  positive  probability  an  end  com¬ 
ponent  in  C.  Once  there,  we  can  stay  in  the  end  component  arbitrarily  long,  accumulating 
an  arbitrarily  large  amount  of  negative  cost,  before  proceeding  to  the  target.  Hence,  we 
have  v*  =  — oo.  The  details  can  be  found  in  [dA97].  The  proof  of  the  converse,  i.e. ,  that 
if  s  $  Diverge{H)  then  v *  >  — oo,  will  be  given  in  Section  4.1.  I 

Once  the  set  Diverge  (n)  has  been  computed,  it  remains  to  compute  v*  for  s  E  S\(R  U 
Diverge^ n)).  To  this  end,  we  first  reduce  the  SSP  problem  by  eliminating  the  states  in 
Diverge(Il).  We  define  a  new  instance  of  SSP  n  =  Converge(Il)  =  (S,  Acts,  A,p,  R,c,g), 
where  S  =  S  \  Diverge(Il),  where  p,  c,  g  are  the  restrictions  of  p,  c,  g  to  S,  and  where  for 
all  s  E  S  we  let  A(s)  =  A(s).  The  reduced  non-positive  SSP  problem  can  then  be  solved 
in  three  ways:  by  eliminating  the  0-cost  end  components,  by  linear  programming,  and  by 
value  iteration. 

4.1  Eliminating  0-cost  components 

The  first  method  for  solving  the  reduced  problem  consists  in  eliminating  the  0-cost  end 
components  using  Algorithm  1  to  compute  n  =  ElirnEC  (H) .  The  following  theorem 
asserts  that  II  satisfies  conditions  SSP-1  and  SSP-2:  hence,  the  SSP  instance  n  can  be 
solved  with  the  methods  presented  in  [BT91,  BT96,  Ber95]. 

Theorem  7  The  non-positive  SSP  instance  n  =  ElimEC(Il)  satisfies  conditions  SSP-1 
and  SSP-2.  Moreover,  let  B\, . . .  ,Bn  he  the  0-cost  end  components  that  are  replaced  by 
states  si, ...  .sn.  Denoting  by  v*  (resp.  v*J  the  solution  of  the  SSP  problems  on  n  (resp. 
II),  we  have  v*  =  v*  for  s  E  S  \  U”=i  am ^  t  =  for  s  £  Bi>  1  <  i  <  n. 

Proof.  Since  the  costs  are  non-positive,  the  cost  from  a  state  never  diverges  to  +00. 
Hence,  by  Theorem  1,  a  non-positive  instance  satisfies  condition  SSP-2  iff  there  are  no 
end  components  entirely  outside  of  the  target  R.  To  see  that  this  condition  holds  for 
n,  note  that  the  end  components  containing  some  negative  cost  have  been  eliminated 
by  Algorithm  2,  while  those  consisting  entirely  of  0-cost  state- action  pairs  have  been 
eliminated  by  Algorithm  1.  The  second  part  of  the  result  is  proved  in  an  analogous  way 
to  Theorem  4,  and  the  proof  can  be  found  in  [dA97].  I 

Theorem  7  also  leads  to  the  second  part  of  Theorem  6.  If  s  0  Diverge^ n),  then  s  E  S. 
The  fact  that  assumptions  SSP  1  and  SSP  2  hold  for  n,  together  with  the  results  of  [BT91], 
ensures  then  that  v*s  >  —00. 

Theorem  8  An  instance  of  non-negative  SSP  problem  n  admits  mem.oryless  optimal 
(proper)  policies  iff  DivergefH)  =  0.  In  any  case,  there  is  always  a  (possibly  non  memo¬ 
ryless)  optimal  proper  policy. 

Proof.  To  see  that  if  DivergefH)  0,  then  there  are  no  memory  less  optimal  policies, 
refer  to  Algorithm  2.  Since  under  a  memoryless  policy  the  MDP  behaves  like  a  Markov 
chain,  under  a  memoryless  proper  policy  each  path  stays  for  a  finite  expected  amount  of 
time  in  the  end  components  in  C  before  reaching  R,  so  that  vvs  >  —  00  for  all  s  E  S  \  Tl. 
On  the  other  hand,  there  is  a  (non-memoryless)  policy  such  that,  once  we  reach  an  end 
component  in  C.  we  stay  for  infinite  expected  time  in  the  end  component  (accumulating 
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an  infinite  expected  cost)  before  reaching  the  target  with  probability  1.  The  proof  that  if 
Diverge  (II)  =  0  there  are  memoryless  optimal  policies  can  be  found  in  [dA97].  I 

4.2  Linear  programming 

Reasoning  as  in  the  proof  of  Theorem  5,  it  is  possible  to  show  that  the  solution  of  the 
SSP  problem  corresponds  to  the  largest  fixpoint  of  the  Bellman  operator.  The  solution 
can  thus  be  computed  by  linear  programming. 

Theorem  9  Consider  an  instance  II  of  non-positive  SSP  problem  such  that  II  =  Converge  (II) , 
and  such  that  there  is  at  least  one  proper  policy  for  each  state.  Then,  the  solution  v*  of 
the  SSP  problem  is  the  largest  fixpoint  of  operator  L  of  (3).  Moreover,  the  following  linear 
programming  problem  has  v*  as  unique  solution: 

Maximize  ^  vs  subject  to  vs  <  c(s,  a )  +  E  p(s,  a)  ( t )  +  E  a)  (*)  9(t) 

ses\R  tes\R  tER 

for  all  s  E  S  \  R  and  all  a  E  A(s). 

4.3  Value  iteration 

The  third  way  to  solve  the  reduced  problem  is  by  value  iteration.  Convergence  to  the 
solution  of  the  SSP  problem  can  be  ensured  simply  by  using  an  initial  estimate  n°  that  is 
identically  0. 

Theorem  10  Consider  an  instance  II  of  non-positive  SSP  such  that  II  =  Converge( II), 
and  such  that  there  is  at  least  one  proper  policy  for  each  state.  Then,  the  solution  of  the 
SSP  problem  is  given  by  lim/j^,^  Lk(0),  where  0  is  the  vector  all  whose  entries  are  0. 

Proof.  The  theorem  follows  from  the  fact  that,  in  a  non-positive  SSP  problem,  all  fixpoints 
of  the  Bellman  operator  are  componentwise  smaller  or  equal  to  0.  Since  the  solution 
computed  by  Theorem  10  is  the  largest  such  fixpoint,  by  Theorem  9  it  is  also  the  solution 
of  the  SSP  problem.  I 

5  Maximum  and  Minimum  Reachability  Probabilities 

An  instance  A  =  (S,  Acts ,  A,p,T)  of  the  maxim, um  or  minimum  reachability  problems 
consists  of  an  MDP  II  =  (S,  Acts,  A, p)  together  with  a  destination  set  T.  The  maximum 
and  minimum  reachability  probability  problems  consists  in  determining,  for  all  s  E  S,  the 
values 

u+  =  sup  Pr%(3k  .  Xk  E  T)  uf  =  inf  Pr y(3fc  .  Xk  E  T)  . 

VEPol  vePol 

Let  Z  C  S  be  the  subset  of  states  that  cannot  reach  T  (so  that  uf  =  0  for  s  E  Z).  From 
[CY90],  we  know  that  the  maximum  reachability  probability  can  be  solved  using  a  linear 
programming  problem  on  the  set  of  variables  {us  \  s  E  S  \  (T  U  Z)  \.  Here,  we  show  how 
our  results  on  the  SSP  problem  can  be  used  to  improve  the  efficiency  of  that  solution,  as 
well  as  to  solve  the  minimum  reachability  probability  problem. 
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Maximum  reachability  probability.  To  reduce  the  maximum  reachability  probability 
problem  to  the  SSP  problem,  we  construct  from  the  instance  A  an  SSP  instance  II  = 
Ssp+  ( A)  =  (S,  Acts,  A,p,  R,  c,  g ),  where  R  :=  Reach(T)  U  Z,  the  cost  c  is  identically  0,  and 
the  terminal  cost  is  defined  by  g(s)  =  —1  for  s  E  Reach(T ),  and  g(s)  =  0  for  s  E  Z.  Note 
that  A  is  both  a  non-negative  and  a  non-positive  instance  of  SSP  problem.  The  following 
theorem  relates  the  two  problems. 

Theorem  11  If  II  =  Ssp+(  A),  then  uf  =  —  vf  for  all  s  E  S\R,  where  uf  is  computed 
on  A  and  vf  on  II. 

Proof.  Since  every  state  of  S\R  can  reach  R,  we  have  that  Reach(R)  =  S,  so  that  every 
state  of  S  has  a  proper  policy.  From  a  memoryless  optimal  policy  r]s  for  the  SSP  problem, 
we  can  construct  a  policy  r/r  that  coincides  with  gs  on  S\R  such  that  —vf  =  uf  ,  yielding 
—v*s  <  uf  for  all  s  E  S\T.  In  the  other  direction,  consider  a  memoryless  policy  rg  optimal 
for  reachability  (we  know  from  [dA97]  that  such  a  policy  exists).  We  have  uf  =  —vf  for 
all  s  E  S  \  T.  Moreover,  r]r  is  proper,  since  every  state  of  S\R  can  reach  T  with  positive 
probability.  This  yields  the  reverse  inequality  —  v*s  >  uf  for  all  s  E  S  \  R,  and  hence  the 
result.  I 

Note  that  we  have  used  algorithm  (2)  to  reduce  the  size  of  the  set  of  states  on  which  the 
maximum  reachability  probability  must  be  determined,  from  S\ (TljZ)  to  S\  ( Reach(T )  U 
Z).  Theorem  11  opens  the  way  to  the  application  of  Algorithm  1  for  the  solution  of 
maximum  reachability  probability  problems.  Since  the  running  cost  c  is  identically  0, 
the  algorithm  eliminates  all  end  components  of  the  MDP  that  lie  completely  outside  of 
Reach(T)  U  Z,  achieving  a  further  potential  reduction  in  the  size  of  the  problem. 

Minimum  Reachability  Probability.  Let  {(Ci,Di), . . . ,  ( Cn ,  Dn )}  =  Mec(S  \  T ,  A) 
be  the  set  of  maximal  end  components  lying  outside  T,  and  let  C  =  U”=i  C?  be  the  union 
of  their  states.  Clearly,  from  Z  U  C  the  minimum  probability  of  reaching  T  is  0.  Moreover, 
the  MDP  does  not  have  any  end  component  completely  contained  in  S  \  (T  U  Z  U  C). 
From  the  instance  A  =  ( S,Acts,A,p,T )  we  construct  an  SSP  instance  II  =  Ssp~( A)  = 
(S,  Acts ,  A,p,  R,  c ,  g),  where  R  :=  T  U  Z  U  C,  the  cost  c  is  identically  0,  and  the  terminal 
cost  is  defined  by  g{s)  =  0  for  s  E  Z  U  C,  and  g(s)  =  1  for  s  E  T.  The  following  theorem 
relates  the  two  problems,  and  it  enables  the  computation  of  the  minimum  probability  of 
reaching  the  target. 

Theorem  12  If  II  =  Ssp~  ( A),  then  uf  =  v*s  for  all  s  E  S,  where  uf  is  computed  on  A 
and  v*s  is  computed  on  II. 

Proof.  The  proof  of  the  theorem  follows  from  the  fact  that  all  policies  of  II  are  proper, 
and  from  the  observation  that  from  a  policy  gs  of  Ft,  we  can  easily  obtain  a  policy  gr  for 
A  such  that  uf  =  iff  for  all  s  E  S,  and  vice  versa.  I 

In  this  case,  Algorithm  1  cannot  be  used  to  reduce  the  size  of  the  problem,  since  there 
are  no  end  components  in  S\R.  The  reduction  has  been  effected  in  a  more  direct  way  by 
adding  the  set  C  to  the  set  of  target  states  of  the  SSP  problem. 
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Optimal  policies.  The  maximum  and  minimum  reachability  problems  admit  memory¬ 
less  optimal  policies.  This  result  is  proved  in  [dA97]. 
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